When forecasting time series data, the aim is to estimate how the sequence of observation will continue into the future
library(modeltime)
library(tidymodels)
library(tidyverse)
library(timetk)
library(lubridate)
library(glmnet)
The Dataset we will be using is bike_sharing_daily. It is a built-in dataset from timetk package.
We will be using only two variables from the dataset - dteday and cnt columns. cnt variable is the response variable
View(bike_sharing_daily)
mydata <- bike_sharing_daily %>%
select(dteday, cnt)
mydata
This displays an interactive time series chart using the stated variables
mydata %>% plot_time_series(dteday, cnt)
We will split the dataset into training and testing data by using the last 3 months in the data as testing set
splits <- time_series_split(mydata, assess = "3 months",
cumulative = TRUE)
Visualizing the training and testing data
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(dteday, cnt)
mymodel <- arima_reg() %>%
set_engine("auto_arima") %>%
fit(cnt ~ dteday, training(splits))
ourmodel <- prophet_reg(seasonality_yearly = TRUE) %>%
set_engine("prophet") %>%
fit(cnt ~ dteday, training(splits))
yrmodel <- linear_reg(penalty = 0.01) %>%
set_engine("glmnet") %>%
fit(cnt ~ wday(dteday, label = TRUE) +
month(dteday, label = TRUE) +
as.numeric(dteday), training(splits))
Create a Modeltime Table (this helps to organize our models)
hismodel <- modeltime_table(mymodel, ourmodel, yrmodel)
hismodel
## # Modeltime Table
## # A tibble: 3 x 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <fit[+]> ARIMA(0,1,3) WITH DRIFT
## 2 2 <fit[+]> PROPHET
## 3 3 <fit[+]> GLMNET
Here we calculate predictions and residuals(error) for the test data
mycalib <- hismodel %>%
modeltime_calibrate(testing(splits))
mycalib
## # Modeltime Table
## # A tibble: 3 x 5
## .model_id .model .model_desc .type .calibration_data
## <int> <list> <chr> <chr> <list>
## 1 1 <fit[+]> ARIMA(0,1,3) WITH DRIFT Test <tibble [90 x 4]>
## 2 2 <fit[+]> PROPHET Test <tibble [90 x 4]>
## 3 3 <fit[+]> GLMNET Test <tibble [90 x 4]>
Lets see the accuracy from our testing dataset predictions
ae - Mean Absolute Error (is the average error aggregated for across d prediction. d smaller d value, d better); we see that ARIMA model is d worst
rsq - R-Squared value (Its value is from 0 to 1. the higher d value, d better): we see PROPHET model is d best
mycalib %>% modeltime_accuracy()
## # A tibble: 3 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 ARIMA(0,1,3) WITH DRIFT Test 2540. 475. 2.74 46.0 3188. 0.390
## 2 2 PROPHET Test 1220. 365. 1.32 28.7 1763. 0.437
## 3 3 GLMNET Test 1289. 373. 1.39 29.7 1835. 0.247
Displays an interactive time series chart using the three models on training and testing data
We see that ARIMA model is not following d time series trend unlike the other models so you can disable ARIMA in d legend bar to remove it from the chart
mycalib %>%
modeltime_forecast(new_data = testing(splits),
actual_data = mydata) %>%
plot_modeltime_forecast()
We will forecast for the next 3 months
We re-train the three models on the full dataset (not the training data) so we get the most accurate predictions in the future
myforecast <- mycalib %>%
modeltime_refit(mydata) %>%
modeltime_forecast(h = "3 months", actual_data = mydata)
myforecast %>%
plot_modeltime_forecast()